Writing a System Agent
Learn how to design a system agent in Go.
We'll cover the following
So far, when we have automated operations on a device, we have either done it from an application that executes locally or through a command we run remotely with SSH. But if we look toward managing a small fleet of machines, it can be more practical to write a service that runs on the device that we connect to via RPCs. Using knowledge of the gRPC services we discussed in previous chapters, we can combine these concepts to allow control of our machines in a more uniform way.
Here are a few things we can use system agents for:
Installing and running services.
Gathering machine running stats.
Gathering machine inventory information.
Some of these are the kinds of things Kubernetes does with its system agents. Others, such as inventory information, can be vital in running a healthy fleet of machines, often overlooked in smaller settings. Even in a Kubernetes environment, there may be advantages to running our own agent for certain tasks.
A system agent can provide several advantages. If we define one application programming interface (API) using gRPC, we can have multiple OSs with different agents implementing the same RPCs, allowing us to control our fleet in the same uniform way, regardless of the OS. And because Go will pretty much run on anything, we can write different agents using the same language.
Designing a system agent#
For our example system agent, we are going to target Linux specifically, but we'll make our API generic to allow implementation for other OSs to use the same API. Let's talk about a few things we might be interested in. We could consider the following:
Installing/removing binaries using
systemd.Exporting both system and installed binary performance data.
Allowing the pulling of application logs.
Containerizing our application.
For those of us not familiar with systemd, it is a Linux daemon that runs software services in the background. Taking advantage of systemd allows us to have automatic restarts of failed applications and automatic log rotation with journald.
Containerization, for those not familiar with the concept, executes an application within its own self-contained space with access to only the parts of the OS we want. This is a similar concept to what is called sandboxing. Containerization has been made popular by software such as Docker and has led to container formats that look like VMs with entire OS images within a container. However, these container formats and tooling are not required to containerize an application on Linux.
As we are going to use systemd to control our process execution, we will use the Service directives of systemd to provide containerization. These details can be seen in our repository here.
For exporting stats, we'll use the expvar Go standard library package. This package allows us to publish stats on a HTTP page. expvar stats are a JSON object with string keys that map to values representing our stats or information. There are built-in stats automatically provided, along with ones we'll define. This allows us to quickly gather stat data using a collector or by simply querying it with a web browser or command-line tool such as wget.
An example expvar page that is output might return the following:
For this lesson, we are going to concentrate on installing and removing binaries and exporting system performance data to show how we can use our RPC service for interactive calls and HTTP for read-only information. The version in our repository will implement more features than we can cover in this lesson.
Now that we've talked about what we want the system agent to do, let's design our proto for our service, as follows:
We now have a general framework for our RPCs, so let's look at implementing a method for our Install RPC.
Implementing Install#
Implementing installations on Linux will require a multi-step process. First, we are going to install the package under sa/packages/[InstallReq.Name] in the agent's user home directory. InstallReq.Name will need to be a single name, containing only letters and numbers. If that name already exists, we'll turn down the existing job and install this in its place. InstallReq.Package on Linux will be a ZIP file that will be unpacked in that directory.
InstallReq.Binary is the name of the binary in the root directory to execute. InstallReq.Args is a list of arguments to pass to the binary.
We'll be using a third-party package to access systemd. We can find the package here on GitHub.
Let's look at the implementation here:
This code does the following:
Lines 1–3: Validates our incoming request to ensure it is valid.
Lines 5–6: Takes a lock for this specific install name.
This prevents multiple installs with the same name at the same time.
Line 7: Unpacks our ZIP file into a temporary directory.
Returns the location of the temporary directory.
Validates that our
req.Binarybinary exists.
Line 12: Migrates our temporary directory to our
req.Namelocation.If a
systemdunit already exists, it is turned down.Creates a
systemdunit file under/home/[user]/.config/systemd/user/.If the final path already exists, deletes it.
Moves the temporary directory to the final location.
Line 15: Starts our binary.
Makes sure it is up and running for 30 seconds.
This is a simple example of the setup for our gRPC service to set up and run a service with systemd. We are skipping various implementation details, but you can find them inside the repository listed toward the end of the lesson.
Now that we have Install done, let's work on implementing SystemPerf.
Implementing SystemPerf#
To gather our system information, we'll be using the goprocinfo package, which we can find here on GitHub.
We want this to update us about every 10 seconds, so we'll implement our gathering in a loop where all callers read from the same data. Let's start by collecting our central processing unit (CPU) data for our system, as follows:
This code does the following:
Line 2: Reads our CPU state data.
Lines 7–22: Writes it to a protocol buffer.
Line 24: Stores the data in
.cpuData.
.cpuData will be of the atomic.Value type. This type is useful when we wish to synchronize an entire value, not mutate the value. Every time we update a.cpuData, we put a new value into it. If we store a struct, map, or slice in an atomic.Value, we cannot change a key/field—we MUST make a new copy with all keys/indexes/fields and store it, instead of changing a single key/field. This is much faster for reading than using a mutex when values are small, which is perfect when storing a small set of counters.
The collectMem memory collector is similar to collectCPU and is detailed in the repository code.
Let's have a look at the loop that will be started in our New() constructor for gathering perf data, as follows:
This code does the following:
Line 3: Collects our initial CPU stats.
Lines 6–13: Publishes an
expvar.Vartype forsystem-cpu.Our variable type is
func() interface{}, which implementsexpvar.Func.This simply reads our
atomic.Valueset by ourcollectCPU()function.A read occurs when someone queries our web page at
/debug/vars.
Line 17: Refreshes our collections every 10 seconds.
expvar defines other simpler types such as String, Float, Map, and so on. However, we prefer using protocol buffers over Map for grouping content in a single, sharable message type that can be used in any language. Because a proto is JSON-serializable, it can be used as the return value for an expvar.Func with a little help from the protojson package. In the repository, that helper code is in agent/proto/extra.go.
In the repository, that helper code is in agent/proto/extra.go.
This code only shares the latest data collection. It is important to not directly read from stat files on each call, as our system can be easily overloaded.
When we go to the /debug/vars web endpoint, we can now see the following:
There will be other stats there that are for the system agent itself, which can be useful in debugging the agent. These are automatically exported by expvar. By using a collector that connects and reads these stats, it is possible to see trends for these stats over time.
We now have an agent that gets perf data every 10 seconds, giving us a functioning system agent. It is worth noting that we have shied away from talking about authentication, authorization, and accounting (AAA) when talking about RPC systems. gRPC has support for Transport Layer Security (TLS) to both secure the transport and allow for mutual TLS. We can also implement a user/password, Open Authorization (OAuth), or any other AAA system we are interested in.
Web services can implement their own security for things such as expvar. expvar publishes its stats on /debug/vars, and it is a good idea not to expose these to the outside world. Either prevent the export on all load balancers or implement some type of security on the endpoint.
You can find the complete code for our system agent here. It also runs in the terminal below:
/
The output of the service above should indicate that the service is starting.
In our completed code, we have decided to implement our system agent over SSH. This allows us to use an authorization system we already have with strong transport security. In addition, the gRPC service is exporting services over a private Unix domain socket, so local services that are not root cannot access the service.
We'll also find code that containerizes the applications we install via systemd directives. This provides native isolation to help protect the system.
In this lesson, we learned the possible uses of a system agent, a basic design guide to building one, and finally walked through the implementation of a basic agent on Linux. We also discussed how our gRPC interface is designed to be generic, to allow for the implementation of the agent for other OSs.
As part of building the agent, we have given a brief introduction to exporting variables with expvar.
Designing Safe, Concurrent Change Automations
Summary and Quiz on Automating Command-Line Tasks